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Abstract 


In this paper, we consider the distributed stochastic multi-armed bandit problem, 
where a global arm set can be accessed by multiple players independently. The 
players are allowed to exchange their history of observations with each other at 
specific points in time. We study the relationship between regret and communica¬ 
tion. When the time horizon is known, we propose the Over-Exploration strategy, 
which only requires one-round communication and whose regret does not scale 
with the number of players. When the time horizon is unknown, we measure the 
frequency of communication through a new notion called the density of the com¬ 
munication set, and give an exact characterization of the interplay between regret 
and communication. Specifically, a lower bound is established and stable strate¬ 
gies that match the lower bound are developed. The results and analyses in this 
paper are specific but can be translated into more general settings. 

1 Introduction 

We consider the distributed stochastic multi-armed bandit (MAB) problem, where a global arm 
set A — \K] can be accessed by M players independently. Each arm a G A is associated with 
an unknown yet fixed probability distribution i/a that belongs to a known family V. The process 
proceeds in rounds. At the beginning of each round t, each player p G [M] pulls an arm Ap t G 
A based on some policy and independently receives a reward Xp t ~ t ■ Some rounds are 
Communication rounds at the end of which every player knows everything other players know. Note 
that when M = 1, the process reduces to a traditional MAB process. 

The goal is to minimize the regret. We denote by pa the mean of the distribution Va and let p* = 
maxo Pa- The regret after T rounds is defined by 


r T M 


■■= TMp* - E EE Xp,t 


.t=l 


We are only interested in consistent policies. A policy is said to be consistent if for any c > 0, 
= o(T'^) always holds. If we further let = p* — pa and Nxia) be the number of times arm 
a has been pulled by all the players in the first T rounds, then it suffices to bound Nt^cl) for all a 
such that Pa 7 ^ p* since the regret can be written as J^aeA ij. //i* [At( a)] . 

1.1 Related Work 

The traditional single-player MAB problem has been studied for a long time. The establishment 
of the lower bounds can be traced back to HQ , in which some asymptotically policies based on 
the notion of upper confidence bound (UCB) are also developed. Later contributions mainly focus 
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on finite-time analysis of UCB-like algorithms (T], [Tst] . Recently, a Bayesian approach called 
Thompson sampling lUsIl is also proved to be asymptotically optimal 

Although there have been regret analyses of single-player models with delayed feedback a ,regret 
analysis of distributed MAB models remains largely unaddressed in the literature. The model used in 
a is very similar to ours, but they focused on best arm identification, which is mainly an exploration 
problem. In |[T^ . a P2P-like gossip based model is proposed. However, their policy is based on the 
e-GREEDY algorithm id which itself requires a lower bound on the gap between the best arm and 
the second-best arm. Eurthermore, only considered the case that the number of communication 
rounds grows linearly with T. There are also other distributed models in which players compete 
with each other d or an adversarial setting is considered |[^. 

2 Oblivious Bandit Policies 

In this section we first propose a general framework for a large number of MAB policies. Then we 
show how this framework enables us to develop communication strategies independent of the bandit 
policy used. 

Definition 1 (Oblivious Bandit Policy). Given K finite set^ of rewards Xi, X 2 , ■ ■ ■ ^ Xk generated 
by K arms, an oblivious bandit policy chooses the next arm based only on these K sets. 

In order to explain how current bandit policies can be translated into this framework and further 
adapted into a distributed setting, we require some additional notation. Given a finite set of rewards 
Xa generated by arm a, the empirical distribution with respect to Xa is defined by 

^ E 

where i5(-) is the Dirac delta function. Eor every player p, every round t, and every arm a, we denote 
by Xp^t{a) the set of rewards generated by arm a that are available to player p at the end of round t 
and let Atp j (a) = \Xp^tia-)\. Clearly, if f is a communication round, then A^f (a) = Atpt (a) for every 
p and a. Eor simplicity, we use E[z/] to denote the mean of distribution 1 / and denote E t(a)] by 
p,p^t{o)- We also use B{p) to denote a Bernoulli distribution with parameter p, and Okl(i^i||i^ 2 ) to 
denote the Kullback-Leibler (KL) divergence of V 2 from vi. The following are two oblivious bandit 
policies that have been adapted into a distributed setting (and will be mainly discussed in this paper). 

UCB adaptation for [0,1] bounded rewards. Let F{f) = In (f In^(f)). Ap^t = argmax^^ Bp fia) 


where B^^{a) = jlp^t-i{a) -k 

KL-UCB adaptation for Bernoulli rewards. Let F{f) = In {t In^(f)). Ap^t = arg max^^ Bp fia) 

where B+^a) = sup |p e (0,1) : £>kl (t>A'p,*_i(a)||S(p)) < 

There are many other oblivious bandit policies in the literature such as most UCB-like policies 
and Thompson sampling. These policies are called oblivious because their choice of the next arm 
depends only on the empirical distribution of the data and the number of data collected. They do 
not rely on a timer or other information such as the player’s own previous choices or other players’ 
previous choices. Note that there are non-oblivious policies such as the DMED policy |@]. 

3 Distributed Bandits: A Paradox 

It is often believed that the more you know, the better you will do. Translating into bandit language, 
the more you communicate, the lower the regret is. However, the following example shows this is 
not necessarily true. 

'Actually these sets are multisets 


Np^t-i{a) 


^Np^t-iici) 
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Figure 1: Adding communication rounds may increase the regret. Strategy A communicates only 
one time {t = 2^^). Strategy B communicates three times (t = 2*, 2®, 2^^). Strategy C communi¬ 
cates over four thousand times (t = 1,2,, 2^^). Results are based on 10"^ independent runs. 


Example 1. Suppose there are 2 players, 2 arms, and a total o/2^® rounds. Ann 1 is associated 
with 13(0.9) and arm 2 is associated with S(0.8). Both players use the UCB adaptation described 
in Section^ Consider the following three communication strategies: (A) communicate once when 
t = 2^^; (B) communicate three times when t = 2'^,2®,2^^; (C) communicate 2^^ times when 
1 , 2 ,..., 212 . 


Intuitively, Strategy A should be the worst and Strategy C should be the best. However, the numerical 
experiment gives a surprising result. Figure [T] shows the average number of pulls of the suboptimal 
arm as a function of time (on a logarithmic scale) for Strategies A, B, C0 

To explain this phenomenon, first recall that almost all the single-player regret analysis 
[IS[Ill goes like follows; 


Once a suboptimal arm has been pulled more than ^ ln(T) times, it will almost 
never be pulled any more before round T. 


Here ^ is a constant depending on the bandit algorithrrQ- Following this argument, we first consider 
Strategy C. It keeps communicating until t = 2 ^^ Then according to the full-communication curve, 
approximately 200 additional explorations of the suboptimal arm are needed before t = 2^®. How¬ 
ever, the two players have to do the 200 explorations separately since they cannot communicate 
any more. Therefore this suboptimal arm is actually explored 2 • 200 = 400 times from t = 2 ^^ 
to f = 2^®, resulting in a much higher final regret compared to the full-communication strategy, 
which only explores the suboptimal arm 200 times during this time period. More generally, after a 
communication round, if an additional A of explorations of the suboptimal arm is needed before the 
next communication round, the actual explorations performed would be M • A. That is, due to lack 
of communication, (M — I) • A unnecessary explorations are performed, resulting in a larger regret. 

On the other hand. Strategy A cleverly makes A very close to 0, forcing the final regret almost as low 
as that of the full-communication curve. When Strategy A finished its only communication, each 
of the players has already collected approximately 500 independent samples of the suboptimal arm, 
which is an over-exploration when 1 = 2 ^^ since the full-communication curve indicates that when 
t = 2 ^ 2 ^ only approximately 300 explorations of the suboptimal arm are needed. However, this 
amount of exploration happens to be just enough for t = 2^® according to the full-communication 
curve. Therefore, when t goes from 2^2 jq 2^®, the suboptimal arm is rarely pulled. 

Finally, consider Strategy B. After each communication, the “over-exploration” phenomenon occurs, 
and the curve acts like that of Strategy A. After some period of time, the amount of explorations 


^In this experiment, we use ln(2t) to approximate ln(X)^i Np^t-i{k)) -f 31n ^In 
for a better comparison when T is relatively small. 


e.g., 


in UCB and 


DKL(B{pa)l|B(p*)) 


in KL-UCB. 
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goes back to the normal level. Then it will act like Strategy C, resulting in an over-exploration again 
before the next communication comes. Strategy B performs communication in a relatively stable 
way. That is, it always keeps A relatively small, making its curve fluctuates not so dramatically 
around the full-communication curve. 

Both Strategy A and Strategy B have implied some good ways to develop communication strategies. 
Strategy A indicates that we can find a time point such that the over-exploration is “just enough.” In 
Section|4]we give the corresponding theoretical guarantees. However, this strategy requires a known 
time horizon, and the right chance for communication is sort of unpredictable for small T and large 
M. Fortunately, Strategy B indicates that we can develop anytime strategies that require very few 
communication rounds with very good performance. The corresponding theoretical guarantees are 
given in Section O However, if the communication rounds are too few, there is no way we can 
use an oblivious bandit policy to achieve a good performance. In Section l5Al we introduce a non- 
oblivious bandit policy called DKLUCB (stands for Distributed KL-UCB) which is asymptotically 
optimal. Most of the results and analyses in Section|4]and Section|5]suppose the KL-UCB adaptation 
described in Section |2] is used. Similar results also hold for the UCB adaptation (with a different 
constant) with only slight modification (even simplification) of our analyses. In particular, DKLUCB 
can be easily modified to get a UCB version called DUCB, which stands for Distributed UCifl 


4 The Over-Exploration Strategy 


In this section, we suppose that a time horizon T is given. That is, the regret is only evaluated 
at the end of round T. As we have discussed in Section [3 we want to find the right chance for 
communication such that the resulting over-exploration is “just enough.” The following theorem 
tells us the right chance is around Tm . 

Theorem 2. If the rewards are Bernoulli rewards and the only communication round is round 
[Tm ], then for every suboptimal arm a. 




ln(T) 

Z?kl {Bipa)\\B{p*)) 


+ o(ln(T)). 


In other words, using the over-exploration strategy, asymptotically the regret does not scale with the 
number of players. Strangely, to prove an upper bound, we must prove a lower bound first. In fact, 
the following lemma is critical to our proof. 

Lemma 3. If the rewards are Bernoulli rewards and the only communication round is round 
then for every suboptimal arm a and any d > 0, 


lim Pr > 

T^oo \ Ir-H-I 


(l-J)ln(r) \ 

Dkl {B{pa)\\B{p*))J 


= 1 . 


This lemma ensures that we over-explore the suboptimal arms enough when t = . One way to 

prove this is to use Theorem 2 in iflll] . However, this has some disadvantages. First, it cannot be 
applied to the UCB adaptation. Second, it requires that we first prove the policy is consistent, which 
is very unnecessary. Finally, it cannot be translated into a finite-time result. Hence, we present a 
direct algorithm-oriented proof here. The key idea is, in order to prove something happens with a 
very low probability, we instead prove that its consequence happens with a very low probability. 


Proof sketch ofLemma\^ Let ^ be a shorthand for 1 / Dkl {B{pa)\\B{u*)). It suffices to prove that 
in a single-player setting, limT-^oo Pr {Nxia) < (1 — 5)^ln(T')) = Co then use a union bound to 
get the desired result. Define the random variable Ty to be the arm that is pulled most frequently 
before round T and L to be the last round before round T that Tx is pulled. Then 

Pr (iVT(a) < (1 - 6)^HT)) < Pr (Ar(a) < (1 - <5)C ln(r) A i?+^(Tr) > 

< Pr > htt + e) + Pr (/2«,(a) < Pa - e) (4.1) 

(e chosen to make this equals O) + Pr ^Dkl {B{pa — e)\\B{p* -f e)) > ^ 

"'As explained in H , UCB is actually a simple relaxation of KL-UCB. 

^ Since we are talking about a single-player setting, here and later, the subscript p is dropped. 
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By definitions we have N^i,^{Tt) > (T — 1) / M, therefore B^^(Tt) —>• Now we 

need to show Nt{a) — cxd in probability. Then use the fact that T't > (T — 1) / M, as well as 
Hoeffding’s inequality, we can prove the two terms in (14.1b converge to 0 as T goes to infinity. □ 

Then we can start our proof of Theorem |2l The key observation is to define random time points 
Ap t, bound the count after Ap t using a standard bandit argument, and then bound the 
count before Ap^t using LemmaO 

Proof sketch of Theorem^ Let 5 > 0 be an arbitrarily small number, ^ be a shorthand for 
1 /T*kl (S(/^a)||S(/r*)), and To be a shorthand for [Tm] . We define random variables and T't 
in the following way; if Nt^ (a) > (1 — ln(T), then ‘f’r = 0 and = Tq; otherwise = Tg 
and T't’ = T. We also define random variables Ap ^ = max{f ; < t < T't, Ap t(a) < 

(1 — (5)^ ln('k 7 ’)}- It can be checked that Ap-r is well-defined. By a standard bandit argument, for 
each player p the expected number of pulls of arm a after Ap t is no more than 45^ ln(T) -(-o(ln(r)), 
which is negligible. The rest expected number of pulls is no more than 

M ■ E ^(a)] < M ■ (Pr($T = 0)(1 - S)^HTo) + Pr($T = To)(l - 6)^\n{T)) (4.2) 

(by Lemma|3| < ^(1 — S)M ln(|’T^ ])+o(l)-M(l — <5)^ ln(T) 

= C(l-<5)ln(T)+o(ln(T)). □ 


5 Stable Strategies 


While the over-exploration strategy works like magic, it has two disadvantages. First, it requires 
a known time horizon, while in practice we often need an anytime algorithm. Second, when T is 
relatively small, almost all the bandit policies do much better than the upper bound suggests, which 
makes the choice of T'sr smaller than actually needed. In other words, LemmaOwould only make 
sense when T'®' is relatively large, which may require a huge T if M is large. 

To develop anytime algorithms, we first introduce the concept of communication set. The communi¬ 
cation set is the set of all the communication rounds. Since we want to develop anytime algorithms, 
we assume the communication set C is an infinite set whose elements are denoted as Ci, C 2 , Cg,... 
in the increasing order. Then we need to measure the frequency of the communication set. 
Definition 4. The counting function on a communication set C is defined by Zc (n) '■= \C U [n] |. 
Definition 5. The density of a communication set C is defined by a{C) '■= lim inffe_>.oo in(c(+i) ' 


The counting function is a natural and intuitive way to define the frequency of communication. It 
basically tells us how many communication rounds there are in the first n rounds for any n. However, 
as we will show later, the density is the true property of a communication set in the bandit world. 
The relationship between these two measurements can be summarized by the following proposition. 
Proposition 6. For every communication set C, (a) if Zc{n) G o(ln(ln(n))), then a{C) — 0; (b) if 
a(C) = 1, then Zc{n) G tLi(ln(ln(n))); (c) ifO < a(C) < 1, then 

Zc{n) 


lim inf 


ln(ln(n)) / In {a(C) 


> 1 . 


In fact. Proposition |6}(a) follows directly from Proposition |6}(b) and (c). As we will show later, a 
higher density leads to a lower regret. Proposition |6l-(b) says that in order to achieve the highest 
density, or equivalently the lowest regret, the number of communication rounds inevitably falls 
into the class a;(ln(ln(n))). On the other hand. Proposition |6l-(c) says that if we are aiming at a 
density greater than 0 and less than 1, or equivalently a regret that is “not bad”, then the number of 
communication rounds should be at least in the order of ln(ln(n)) / ln(a(C)“^). 


Now we can use the concept of density to establish a lower bound. 

Theorem?. Let = mf^'g'p:E[y']>a C be the communication set, andirbe 

a consistent bandit policy. Then for every suboptimal arm a satisfying 0 < Tinf F* I'P) < 00. 

E[A^jl(a)] M 1 

ln(T) “ 1-I-(M - 1 )q;(C) Dint {t^a, F* ,'P)' 
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To proof Theorem |2l we need to translate any distributed bandit into a single-player bandit, then 
perform a reduction from the single-player lower bound, and finally calculate out the new constant. 


Proof sketch of Theorem^ First, for each player p, we can use Theorem 1 in 0] to show that 


t-i-oo ln(<) Anf(r'a,At*,T’) 


(5.1) 


It can be done by relabel the round number in a distributed bandit and translate it into a serialized 
process, in which a player periodically changes his “role” so that he does not make decision based 
on all the data available in this serialized process. Then it remains to do some calculation. By the 
definition of a{C), there exists a subsequence of {Ck)k>i, denoted by (Cn^)s>i, such that 

lim ln(C'„J / ln(C'„^+i) = a(C). (5.2) 


Let a; = limsup 7 n_^oc E[7VT(a)]/ln(r). If a; = oo, the desired inequality holds trivially. Otherwise, 


X > lim sup > lim inf 


- (M - 1) • Nc^Ja) 


s—foc ln(Cns + l) s^oo 


M 


> y liminf - (M - 1) . limsup 

^ ^ .<?—^nn \^lt \ 


p^l 

M 


> } lim inf 

« ^ f .—Vro 


s^oo ln(C'„^+i) 


ln(a,+i) 

E[jVc„.(a)] 

ln(C', 


ns-\-l) 


p=l 


ln(<) 




s—l-oo ) s—>-oo ln((Il7ig-|_i) 


> 


M 


Solving X concludes the proof. 


— (M — 1) • a; • a{C) (by (15.11) and (15.21) ) . 


□ 


5.1 Oblivious Policies Under Dense Communication Sets 


Theorem|7] shows that if we want to achieve the optimal regret, or in other words, if we do not want 
the regret scale with the number of players, then the density of the communication set must be 1. 
Note that linear grid {d, 2d, 3d,... } and exponential grid {q,q^,q^,...} both have density 1, while 
a double-exponential grid , ■ • ■ } has density 1/(1 -f e) < 1. The following 

theorem shows the KL-UCB adaptation achieves this lower bound when a(C) = 1. 

Theorem 8. If the rewards are Bernoulli rewards and the communication set C satisfies a(C) = 1, 
then for every suboptimal arm a. 


E [iVyu“(a)] < 


ln(T) 

Dkl {B{f,a)\\B{p*)) 


+ o(ln(T)). 


Due to the limited space, we defer our proof to Section|521 where the DKLUCB policy is introduced. 
DKLUCB, as a generalization of the KL-UCB adaptation, is optimal even if a(C) < 1. 


5.2 Non-Oblivions Policies for Sparse Communication Sets 

In Section 15.11 we showed that the KL-UCB adaptation is optimal for dense communication sets 
(i.e., a{C) = 1). However, if the communication set is very sparse (i.e., a{C) < 1), then we cannot 
expect an oblivious policy to do uniformly well. As Strategy C in Example [T] has demonstrated, 
the main difficulty in designing an algorithm for the distributed MAB problem is that each player is 
“isolated” from others during the period between two communication rounds. In the single-player 
setting, if one player pulled a suboptimal arm, he should be more certain that this arm is not optimal. 
Therefore, he should explore this suboptimal arm less frequently in future rounds. However, when 
it comes to the distributed setting, although each player can utilize the information produced by a 
suboptimal decision made by himself immediately, other players would not know this experience 
until next communication round. 
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Based on the observations above, the key idea is to make each player explore less, since the results 
of exploration will become common knowledge at the next communication stage. This is imple¬ 
mented by having each player attempt to predict the number of pulls from each arm made by other 
players since the last communication round. These predictions can be wrong, but not that wrong 
as all processes are running the same algorithm. In fact, we can show that the errors in these count 
predictions are negligible as T goes to infinity. However, even if these count predictions were fully 
correct, we still cannot simulate the full-communication KL-UCB adaptation. This is because the 
number of the data available is less than the count prediction. For this reason, we have to use a larger 
confidence bound. The following is a non-oblivious distributed policy called DKLUCB (stands for 
distributed KL-UCB), where we replace the count 7Vp_t(a) in the KL-UCB adaptation with the count 
prediction ^(a), and replace the original confidence bound with a slightly larger one. 

DKLUCB for Bernoulli rewards. Define £{t) max {u < t: u € C \/ u = 0}, 


M(ln(<)-f 31n(ln(<))) Ni^t){a) 

= l + (M-l)a(C) ’ 




= ^p.i(a) + (M - 1) • min (iVp,i(a) - (V^(t)(a), Mt(a)) . 
Then choose Ap^t = argmax^ Bp^{a), 


where B+^a) = sup p G (0,1) : Dkl < 


T 


(s 




N't-M 


The DKLUCB policy is a non-oblivious policy because player p in round t makes decision not only 
based on Xp^t-i{a), but also based on which is the number of pulls of each arm at the 

end of last communication round. Note that the KL-UCB adaptation can be seen as a special case 
of the DKLUCB policy. In fact, when a{C) = 1, DKLUCB is identical to the KL-UCB adaptation. 

Theorem 9. If the rewards are Bernoulli rewards and the communication set is C, then for every 
suboptimal arm a. 




M 


ln(T) 


l + (M-l)a(C) DKL{Bipa)\\B{p*)) 


o(ln(T)). 


This theorem shows that the DKLUCB policy can achieve the lower bound in Theorem |2l This 
is consistent with the existing results and intuition. If M = 1, then DKLUCB is identical to the 
single-player KL-UCB policy, and the upper bound is the same. If a(C) = 1, DKLUCB is identical 
to the KL-UCB adaptation, and the upper bound is still the same (this is formalized as Theorem|8]l. 
If a{C) = 0, DKLUCB is no better than M single-player KL-UCB policies running independently. 

As in SectionlH to prove an upper bound, first a lower bound is needed. However, this time we need 
a lemma much stronger than Lemma[3 and theorem 2 in lITIl] definitely will not help. 

Lemma 10. If the rewards are Bernoulli rewards and the communication set is C, then for every 
suboptimal arm a and any (5 > 0, 


lim Pr 

T —^■OO 



I ^DKLUCB > 


M 

1 + {M -l)a{C) 


(1 - 6) Injt) \ \ 

Z?KL(i3(/ra)||S(F*))J ) 


= 1 . 


The idea of the proof is similar to that of Lemma |3 but there are many new ingredients. First we 
require the following claim to relate the count prediction AT ^{a) to the true count Nt{a). 

Claim 11. For every a, t, and constant c, if Nt{a) < c, then there exists p such that N'p ^{a) < c. 

Proof sketch of Lemma\T^ Let ^ be a shorthand for the leading constant before ln(f). It suffices to 
prove limT-s-oo Pr(Ut>T < (1 — ln(f)}) = 0. For every round t, we define a random 

player pt such that AT^ ^(a) < AT, ^(a) for every p' G [M]. We say player pt is chosen for round t. 
We also define random players Fj such that Ft is the player who is chosen most frequently before 
round t. Let Sj be a random set including the rounds (before round t) in which player F t is chosen. 
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Let random variable Tt be the arm that is pulled most frequently by Lj in those rounds in and 
let 'I't be the last round in Sj that Ft pulls Tj. By Claim fTTl 

{A^t(a) < (l-5Kln(f)} C < (1 - 5Kln(f) A 

By definitions of Tt, Ft, and we have'I 't > and 7Vrj,$j (Tt) > {t—l)/{MK). 

The rest is close to the proof of Lemma[3l requiring more careful dealing with the infinite union. □ 

The following claim explains why we need a slightly larger confidence bound in DKLUCB. 

Claim 12. For every player p, arm a, and t > 0, N!p^{a) < M / (1 + (M — l)a(C)) • Np^t{a). 

In the proof of Theorem|2] choosing Ap t is relatively easy. However, Theorem|9]requires choosing 
Ap T more cleverly. And the overlapping histories cannot be simply ignored like in Theorem|2l 

Proof sketch of Theorem^ Let <5 > 0 be an arbitrarily small number and ^ be a shorthand for the 
leading constant before ln(T). Define Ty to be the largest positive integer such that < T 
and Ncr^{o-) < (1 ~ ^)Cln('F). Dehne random variables Apx to be the last round such that 
Ctt ^ ^p,T < Ctt+i and ^p,Ap < (1 ~ ln(min(C'Tr+i) ?’))> or Ctt if there is no such 
round. Using Claim[T2] a standard bandit argument would show that for each player p the expected 
number of pulls of arm a after Ap t is no more than 4(5^ ln(T) +o(ln(r)), which is negligible. Using 
a decomposition similar to (I4.21 i. by LemmafTOl the rest expected number of pulls is 
M ■ E[AtxA,^^(a)] - (M - l)E[AtY^(a)] = E[Ati.A,,^ (a) + (M - l)(Ati,A,,^(a) - Ncr^{a))] 
< E[iVi.At.,(a) + (M - l)(iVi,At.,(a) - Ncr^{a)) \ Nc^Ja) > (1 - S)^HT)] + o(ln(T)). 
By the definition of AF ^(a) and Ap x, we have 

(®) = ^i.Ai.T (o) + (M - 1) • min (^A^i,Ai,t (a) - Ncr^ (a), mAi,t («)) < (1 - ln(T), 
this plus the definition of a{C) will together imply that, given (a) > (1 ~ ln(r), 

A^i,Ai,T(a) + {M - l)(AtxAi,r(a) “ -^^ 0 x^( 0 )) < Cln(r) + o(ln(r)). □ 

6 Discussion 

Finite-time analysis. We write all the analyses in a way such that they can be translated into finite¬ 
time analyses and produce hnite-time results (e.g., we avoid using Theorem 2 of lITll] to get Lemma 
|2l. However, our results heavily rely on Lemma [3 and Lemma fTOl where the technique we used 
works fine to show the probability converges to 1, yet will produce horrible constant if translated 
into a hnite-time version. Specihcally, every time we use pigeonhole principle (e.g., “the arm that 
is pulled most frequently”), we will add a constant K or M, which we guess is not necessary in the 
hnite-time bound. Our hypothesis is that there are more elegant ways to prove Lemma[3and Lemma 
[TOl which will provide tighter constants. We hope we can solve this problem in future works. 

Difference between bandits and distributed bandits. The typical way to prove an upper bound 
for traditional bandits is to show that once the upper bound is reached, later suboptimal decisions are 
negligible. However, this does not work for distributed bandits. The main difficulty is that even if the 
upper bound is reached globally (Nt{a) has reached the upper bound), it may not be reached locally 
(Np^t{o,) may be way less than the upper bound). That is why we need the density of communication 
set, Lemma[3 and Lemma[T0]to make sure the desired upper bound is reached locally. 

Beyond UCB and KL-UCB. Most of the results and analyses in this paper are very specihc, they 
are either restricted to the KL-UCB adaptation (or the UCB adaptation, after some modification 
or even simplification, as we have mentioned), or to a generalization of the KL-UCB adaptation 
(i.e., DKLUCB). However, results similar to Theorem|2] Theorem|8] and Theorem|9]can be repro¬ 
duced for any UCB-like oblivious bandit policies, as long as they behave normally in the sense that 
properties similar to Lemma [3 and Lemma [TO] hold. These results can also be extended to other 
oblivious bandit policies that are not based on confidence bound such as Thompson sampling ifisll . 
The intuition behind this is, all the bandit policies behave similarly, and possibly indistinguishable 
by observing the actions they take. This inspires us to develop a universal framework to prove the 
performance of bandit policies under distributed settings. However, this framework requires much 
more insights and we would like to leave it as future work. 
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Appendix A Preliminaries 

In this section, we review some notations and introduce new ones that will be used in later proofs. 
We denote the set of all the positive integers by N'*' and the set of all the positive real numbers by 
R+. We denote the set {1, 2,3, • • • , AT} by [K]. We denote by |A| the cardinality of a set A. The 
mean of distribution v is denoted by E[^]. We use A to represent logical conjunction (AND) and 
use V to represent logical disjunction (OR). “A” has higher precedence than “V”. Both “A” and “V” 
have higher precedence than other connectives such as “=” or 

A.l Distributed Bandit Process 

The communication set is an infinite set that contains the indices of the communication rounds and 
is always denoted by C. The elements of C are denoted by Ci, (72, Cs,... in the ascending order. 
We define the function £ : N+ —> N by 

£(t) := max{t6 <t:uCC\/u = 0}. 
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That is, £{t) is the last communication round in the first t rounds, and it takes value 0 if there is no 
such a round. We now define a strict partial order ^ on all the rewards X = „ : m > 1, v > 1}. 

Concretely, Xui,vi -< Xu 2 ,v 2 if and only if 

Vl < £{v2) V Ml = U2 a Ui < V2- 

That is, Xui,vi ^ -^ 112 ,t )2 if and only if anyone who has collected reward X^^ y^ niust also have 
collected reward Xy^ . For each player p, we define -<* to be a linear extension of -< such that 
Xui,vi -<p if and only if 

Ml < £{v 2 ) V £{vi) = £{v 2 ) a (mi = P V Ml < U 2 ) 

Note the subscript p means this linear extension is defined differently for each player. For player p, 
this linear extension gives an order on the rewards he receives. It can be checked that both ^ and 
-<* are legitimate definitions. We also define 

- {Xyy . Xy^y ^ Xp^^^l A Xy ^y G,} {Xyy . Xy^y Xp^^^l A Xy ^y g} 

andfVp_t(a) := \Xp^t{a)\- Thatis, T’p t(a) is the set of rewards from arm a that player p has collected 
at the end of round t, and Np^t (a) is the cardinality of this set. 

For each player p and each arm a, we define a sequence of random variables {Xf’‘^)i>i where Xf’^ 
is the ith element in the set 

{Xy y . Xy y C X A Xy^y - g} 

with respect to the order -<*. We also define 

f^pAa) = / s- 

By this definition we can see that Pp,t{a) = PpXp t(a)(A- 

Recall that the random variable /ip ((g) is the empirical mean of those rewards generated by arm 
G that are known to player p at the end of round t. There are iVp t(a) of these rewards, each of 
them obtained either by player p pulling arm a himself, or via communication (i.e., from other 
players). Thus we need a lemma to ensure that the additional data obtained from other players are 
indistinguishable from the data collected by players themselves. In other words, we shall prove that 
/tp^s (a) is the empirical mean of s mutually independent random variables with the same distribution 

I'a- 

Lemma 13. For every player p, every arm a, and every s € N"*", 
where ■) is a binomial distribution. 

Proof. Fix an arm a and a player p. We denote by F{x) the cumulative distribution function cor¬ 
responding to Uy. In addition, we define random variables pi and ti such that pi is the player 
who first receives the reward Xfand f is the round this receiving takes place. Clearly we have 
= Xp.^ti, and therefore 

s • ftp,s{A = 'y ^ Xp^ ^ti- 

i=l 

Hence, now it suffices to show that Xp^^a are mutually independent random variables with the same 
cumulative distribution function F{x). 

First we will prove that for every i, Xp^^ti has the cumulative distribution function F{x). That is, 
Pr(Xp._t. < A) = F{X) for any A S R. Note that 


Pr{Xp^^ti < A) = ^ Pr {Xy^y < X A pi = u A f = v) 


Pr {Xy^y < X\p^ = u AU = v)PT{pi = u Ati = v) 

u,v:Pr{pi—uAti—v)'>0 

F{X)Pt {pt = u Af = v) = F{X) (A.l) 

u,v:Pr{pi—uAti—v)'>0 
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where the second equality holds because given pi = u and ti = v, the reward Xu^v is generated by 
arm a independently. 


Then we will prove mutual independence by induction. It suffices to show that for every s real 
numbers Ai, A 2 , • ■ • , As and s integers li < I 2 < ■ ■ ■ < h have 


Pr <A0 




2=1 


The base case where 5 = 1 has been proved in iA.lj . Suppose we have proved that mutual indepen¬ 
dence holds for 5 — 1, and we also define the event 8s,u,v by 

8s^U,V •— ^ ^ Pis — W A tl^ — ^ ' 


Then we have 


Pr 




^ ^ Pr (^Xu,v ^ A 8s^u,v') 

u,v 


= y] Pr(X„,„ < As I fs,«.i;)Pr(fs,'U,,;) 

2i,'u:Pr(5s,u,u)>0 

y] F(As)Pr(£:s.„,,) 

2i,'u:Pr(5s,u,u)>0 

s-1 /s-1 

= F(As)nPA 

i^l \i^l 
S-1 

= FiX.)l[ F{Xi) (by induction hypothesis) 

2=1 

S 

i=l 

where the third equality holds because given the event Es,u,v, the reward Af„ „ is generated by arm 
a independently. □ 


Technically, this lemma is required whenever the Hoeffding’s inequality is used to bound /ip s(a)- 
For simplicity, later proofs may use this lemma without explicitly pointing it out. 


A.2 Kullback-Leibler Divergences 


The Kullback-Leibler divergence (KL-divergence) from probability distribution i/i to probability 
distribution 1/2 is defined by 


-Dkl(j^i||i^2) — — 



Accordingly, we define 

Dinfii',a,P) := inf Dki,{v\\v'). 
i''e'P:E[i/']>a 

Clearly, for two Bernoulli distribution i/i and 1/2 satisfying < E[:/ 2 ], we have 

L’kl(j^i||j^ 2) = /Cinf(j^l,E[j^2],S), 

where B is the set of all the Bernoulli distributions. Note that the parameter of a Bernoulli distribu¬ 
tion usually takes value in the open interval (0,1). However, the empirical mean of Bernoulli trials 
can take value in the closed interval [0,1]. Hence we define the extended Bernoulli distribution with 
parameter p G [0,1] to be a distribution having probability mass p on 1 and 1 — p on 0. We let 

/C(p, g) ;= p In -f (1 - p) In 

be the KL-divergence from an extended Bernoulli distribution with parameter p to another with 
parameter q, with conventions 0 • ln(0) = 0 and ln(0 / 0) = 0. We also define the left-side truncated 
KL-divergence /C'(p, g) as 0 if p > g, or /C(p, g) otherwise. 
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A.3 Tools From Single-Player Bandits 

The original proof of the KL-UCB algorithm mainly relies on the following self normalized devia¬ 
tion bound, which we cannot avoid either. 

Lemma 14 (Ht]). Let jig be the empirical mean of s mutually independent Bernoulli random vari¬ 
ables with the same mean p, then 



Appendix B Proof of Theorem |2] 

Let (5 > 0 be an arbitrarily small number, be a shorthand for I/Ukl {^{lJ'a)\\l3{p,*)), and Tq be 
a shorthand for [Tm]. We define random variables 'Pt in the following way: if NTo{a) > 
(1 — S)^ ln(T), then = 0 and T't = Tg; otherwise = Tq and T't = T. We also define 
random variables Ap.r = max{f : t < 'I't, Ap t(a) < (1 — i5)^ In(T'T)}- It can be checked 

that Ap T is well-defined. 

Step 1: Bound the count after Ap r (traditional bandit argument). We first do an event decom¬ 
position: 

{Ap^t = a} C {B+((a*) < p*} U {Bp ^{a) > p* /\Ap^t = a} , for f large enough. 

Then we will show that the event \^Bp^{a*) < p*} can be safely ignored. 

Lemma 15. For every player p and every arm a, 

T 

<l^a) =0 (ln(r)) . 


Proof. Note that 

Pr(B+j(a) < Pa) < Ft {Np^t{a)lC'{pp^Np,tia){a), da) > ln(f) -f 3ln(ln(<))) 

/ Mt \ 

< Pr ( U {Ap.s(a) < Pa F sK.{pp^s{a),pa) > ln(f) -f 31n(ln(t))} 


< 


e|’(ln(f) -f 31n(ln(t))) In(Mt)] 


(B.l) 


where the last inequality follows from Lemma fT^ Sum (IB. Il l from 1 to T yields o(ln(T)). □ 


Hence we can ignore the event {Bpf{a*) < p*} and only bound the probability of the event 
{-B+t(a) > p* AAp^t = a}. 

T 


E 


.t=AT-l-l 
T 


< E 


< E 


- A „_i _1 *' 


i=AT+l 






AA 


p,t — 


-E 


^ ^ l{A''p,t(a)<(l+i5)5 ln(t)AAp,t=a} 

.t=Ar-l-l 


r(ln(T)) 


,s=0 '■ f 


45^ ln(T) -(- o(ln(T)) (by the definition of Ap^y) 


< ^ Pr (/tp^s(a) > p-\- e) 45^ ln(T) -f o(ln(T)) (for some e > 0) 


s=0 


< e ® -f 45^ ln(T) -f o(ln(T)) (by Hoeffding’s inequality) 

s=0 

= 45ein(r) + o(ln(r)), 
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Therefore, for all the players, the count after Ap t is no more than AM5^ lii(T) + o(ln(T)). 

Step 2; Bound the count before Ap t’. The total count of all players before Ap 7 ^ should be no more 
than 

M 

[iVp,Ap,T(a)] - (M - 1)E [iV<i,^(a)] 

p^l 

<M.E[iVi,A,_,(a)] 

< M ■ (Pr($T = 0)(1 - 6)^ln(To) + Pr(T>T = ro)(l - S)^ln{T)) 

(by Lemma[^ — S)M ln( [T ]) +o(l) ■ M(1 — 6)^ ln(T) 

= ?(l-<5)ln(T) + o(ln(T)). 

Step 3: Put everything together. Adding up all the components, we get 

E [iVT(a)] < e (1 + (4M - 1)5) ln(r) + o(ln(r)). 

This concludes the proof. 


Appendix C Proof of Lemma |3] 

Let ^ be a shorthand for 1 / Dkl (' 6 (/ra)|| 6 (^*)). Then it suffices to prove that in a single player 
setting, limr^-oo Pr (Nria) < (1 — S)^ ln(T)) = 1 . and then use a union bound. 

We can see this as weaker form of a special case of Lemma [TOl It is weaker because it does not 
contain a infinite intersection like Lemma [TO] It is a special case becuase we can let M = 1 in 
Lemma [To] For these reasons, the proof should be a simplified version of the proof of Lemma [TOl 
To avoid duplication, we refer the reader to the proof in Section iGl 


Appendix D Proof of Proposition |6| 


Since Proposition | 6 }(a) is a direct consequence of Proposition | 6 }(b) and (c), it suffices to prove the 
latter two statements. Let 

C = {ln(Ci),ln(C2),ln(C'3),---} 

and we denote In(C'fe) by Ck- Assume a{C) = d G {0, 1]. Then for every e G (0, d), there exists a 
TVe large enough such that Ck / Ck+i > d — e for every k > N^. Thus, for n large enough. 


ln(n) < C' 2 :c(n)+i < 


Cm 


((j _ ^'jZcin) + l-N^ ■ 


It then follows from the inequality that 


ln(ln(n)) - ln(C'A,J 

-+ "• - 


Note that e can be chosen to be arbitrarily small. If d = 1, then Zc{n) G u>(lnln(n)). Otherwise, 
we have 

]• ■ f > 1 

ln(ln(n)) /In(d-i) “ ' 

This completes the proof of Proposition| 6 }(b) and (c). 


Appendix E Proof of Theorem 0 

For the single-player MAB model, fTTI] gave a lower bound for single-parametric distributions. 0] 
generalized this result to non-parametric models. Translated into our model, their results can be 
summarized as the following theorem. 
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Theorem 16 (|01)- If M = 1 and tt is a consistent policy, then for every suboptimal arm a satisfying 
0 < Dini{Va, p* ,V) < oo, 

T^-oo ln(r) {Va, p* ,V) 

Now our goal is to establish a similar lower bound for the distributed case (i.e., M > 1). Fortunately, 
we can actually use Theorem [16] as a stepping stone. More specihcally, we could use a simulator 
to simulate all the actions of the M players and apply Theorem[T6]to this single simulator. In other 
words, we treat every multi-player MAB process as a single-player MAB process whose outcome 
is indistinguishable from the original one. This conversion does not provide a direct solution to the 
lower bound for the distributed MAB problem. However, we can use it to establish a lower bound 
on Ap t (a) for every suboptimal arm a. 

Lemma 17. If n is a consistent policy, then for every suboptimal arm a satisfying 0 < 
Dinf{i'a,P*,‘P) < oo, 

.^E[A-t(a)]^ 1 

limini-^ 

t->oo \n(t) Dini{Va,P*,V) 

Proof Fix a player p and a suboptimal arm a. Let V be the original distributed process. The 
rewards in the set X can be viewed as ones generated by a single-player MAB process S in order 
^p using a single-player policy tt'. Now we use coupling to associate process S to process V and 
use superscripts to distinguish random variables in the two processes. Note that 

Nf{a) < Ap® (a) < Nf{a). (E.l) 

Note that the subscript t in Nf (a) is referring to the time point t in the serialized process, while the 
subscript t in N^^ia) and A® (a) are referring to the time point t in the distributed process. 

Hence, if tt is a consistent policy (for the distributed MAB), then tt' is a consistent policy (for the 
single-player MAB). Thus, 

> _i_.. 

t->oo ln(f) t^-oo ln(f) Di-n{{Va, p*,1^) 

where the hrst inequality follows from (IE. Il l and the second follows from Theorem[T6| □ 


Having a lower bound on Ap ((a) is almost equivalent to having a lower bound on At(a). In fact, 
we have for every f > 1, 

M M 

Nt{a) > ^ (Ap,t(a) - A,(,)(a)) = ^ Ap,t(a) - (M - 1) • A,(,)(a). (E.2) 

p—1 p—1 

Using ( IE.2I) we can hnish the proof of Theorem]?] Since a{C) = lim inf fc^oo ’ there exists 

a subsequence of denoted by (C'„,,)s>i, such that 

ln(C'„J 


lim 

s-i-oo ln(C'„,+i) 


= a{C). 


Let X = limsup-p^Qo > 0. If x = cxd, the desired inequality holds trivially. Assume x is 


hnite. Then 


X > lim sup 


E[iVc„,+i(a)] 


E 


> lim inf ■ 


- (M - 1) • Nc„M) 


s—>-oo ln(Cn, + l) 


M 


> y liminf hmsup 


p^l 

M 


s^oo ln(C'„^+i) 


ln(C'„,+i) 

E[jVc..(a)] 
\n{C, 


Us + l) 






> 


M 


Di^{{lTa,p*,'P) 

Solving X concludes the proof. 


ln(f) 

— (M — 1) • a; • a{C). 


S—J-OD ln(C'„J ln(C'„3+i) 

(by LemmafTTb 
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Appendix F Proof of Theorem |8] and Theorem |9] 

Theorem [8] is a special case of Theorem |9| If a(C) = 1, then the DKLUCB algorithm reduces to 
a simple adaptation of the KL-UCB, and the upper bound is identical to the single-player upper 
bound. Therefore, it suffices to prove Theorem|9] 

Let (5 > 0 be an arbitrarily small number and ^ be a shorthand for the leading constant before 
ln(T). Define to be the largest positive integer such that < T and Nc-^ (a) < (1 “ 
(5)^ ln(T). Define random variables Ap x to be the last round such that Cx^ < Ap x < C'xt-i-i 
A p ^ ln(min(C'xr-i-i) ’^))- If there is no such round, let Ap_x = Ctt- 

Step 1: bound the count after Ap x (traditional bandit argument). 

We first do an event decomposition; 

{Ap^t = a} C \^Bp ^{a*) < p*} U > p* AAp^t = a} , for f large enough. 

Then we will show that the event {^Bp^{a*) < fJ.*} can be safely ignored. 

Lemma 18. For every player p and every arm a, 

T 

^ Pr {B+^a) <pa) =0 (ln(r)) . 


Proof. Note that 

B^{Bpyia) < Pa) < PT{N^ tia)K:'{pp^N^pa){a),pa) > c(ln(f) + 31n(ln(f)))) 
(by Claim[T2)) < Pr {Np^t{a)IC' {pp^Np,tia){a), Pa) > ln(f) + 31n(ln(f))) 


' Mt 


< Pr U {Ap.s(a) < Ma A slC{pp^s{a), Pa) > ln(f) + 31n(ln(t))} 


< 


e|"(ln(f) + 31n(ln(f))) In(Aff)] 


(F.l) 


where the last inequality follows from Lemma[T4] Sum (IF.Il l from 1 to T yields o(ln(T)). 


□ 


Hence we can ignore the event < p*} and only bound the probability of the event 

{-B+t(a) >p* AAp^t =a}. 


E 


T 

E ^ 

,t=AT-l-l 


{B+j(a)>/i*AAp,t=a} 



r ^ 



T 

■ 

< E 


— AAp,t=a| 

+ E 

E 

^{'^P,t(“)<(l+'5)C ln(t)AAp,t=a} 


Lt=AT + l ^ 



Lt=AT-l-l 


< E 

.s=0 . 

+ AS^\n{T) + o 

(ln(T)) 

(by the definition of Ap x) 


oo 

< ^ Pr (/tp^s(a) > p + e) + 45^ ln(T) + o(ln(T)) (for some e > 0) 

s=0 


< e ® + 45^ ln(T) + o(ln(T)) (by Hoeffding’s inequality) 

s=0 

= 45ein(r) + o(ln(r)), 

Therefore, for all the players, the count after Ap x is no more than AM5^ lii(T) + o(ln(T)). 


+ o(ln(T)) 
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Step 2: bound the count before Ap y. The total count of all players before Ap ^ should be no more 
than 


M 


[7Vp,A^,,(a)] - (M - 1)E [Ncrja) 

p^l 

= M-E [Afi,A,_^(a)] - (M - 1)E fA^Cx^(a) 


= E 


^i,Ai,T(a) + (Af - l)(A^i,Ai,T(a) “ Ncr^{a)) 


<Pr (iVcx, > (l-5)Cln(CTx)) • 

E iVi,Ai,T(a) + {M - l)(7Vi_Ai,T(a) “ Nc-^^{a)) \ Ncr^ > (1 “ ^)Cln(C'TT) 
+ Pr [Ncr^ < (1 - ^)Cln(C'TT)) • 

A^uAi,t(«) + (Af - l)(iVi,Ai,T(a) - NcrAa)) \ Nc-,^ < (1 - ^)Cln(C'Tx) 


< E 


E 

iVi,Ai,x(a) + (M - l)(iVi.A,,x(a) - Nc^M) I ^ (1 " ^)Cln(CTx) 
+ o(ln(T)) (by Lemma fTOll 


(F.2) 


Our goal is to prove (IF.21 l is no more than ^ ln(T)+o(ln(r)). It can be done by showing A^i,Ai t (®) + 
(M - l)(Afi,Ai,x(a) - AfcTx(a)) <'?ln(T) + o(ln(T)) given iVcx^ > (1 - ^)^ln(C'Tx)- Now 
we suppose > (1 “ In(C'Tx) holds. If iVi^Ai t (®) “ ^ '“Ai t (®)’ '^hen it is trivial 

since in that case iVi^Ai,T(«) + (Af — l)(iVi_Ai,T («) “ -Nctj,(®)) = Ai already 

know N[ ^(a) < (1 — (5)^ln(T). Otherwise we have 


A^i,Ai,t(®) + (Af — 1) 


M 


a{C) 


- 1 < (1 - ^)^ln (min(T, Cxx+i)) • 


Using the condition > (1 “ 5)^ In(CTT) '''® 

iVi.A,.x(a) < (1 - <5)e (^ln(min(r,CTx+i)) - ■ 

By the definition of a(C) we have 

7Vi,A,,x(a) < (1 - <5)e {(1 + MCtx)) + o(ln(r)) 


M 


M 




M M 

Therefore, given Ncr > (1 “ 5)^ ln(C'Tx)j 


(F.3) 


A^i,Ai.x(a) + (Af - l)(A^i.Ai,T(a) - Ncr^{a)) 
= M ■ iVi.Axx (a) - (AT - 1) • Ncr^ (a) 

< M ■ iVi.Ai,x(a) - (M - 1)(1 - In(CTx) 

< e ln(r) + o(ln(T)) (by (IF3 T i) 


Hence, 

E[iVxAi,x(a) + (M-l)(7Vi,Ai,x(a)-AfCxx(a)) lA^Cxx ^ ^ W + o(ln(r)) 

Step 3: put everything together. Adding all components up, we get 

E[A^r(a)] < ^(1 + 4M<5) ln(T) + o(ln(T)). 

This concludes the proof. 
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Appendix G Proof of Lemma [10] 

In this section, in order to make the proof easier and more readable, we will prove a lemma equivalent 
to Lemma [Toj We first introduce some new concepts. 

Definition 19. We say a sequence of random variables (^n)„>i 

(a) converges to a constant c in probability, denoted by A c, if for every e > 0, 

lim Pr (|Xji — c| > e) = 0; 

n—foo 


(b) tends to infinity in probability, denoted by Xn A- oo, if for every N, 

lim Pr {Xn < TV) = 0. 


The following is the equivalent lemma we will prove. 

Lemma 20 (equivalent to Lemma [TOb . Let (<l*ri)n>i be o sequence of random variables such that 
A cxD. Then for every suboptimal arm a and every <5 > 0, 


lim Pr 

n—foo 


(a) > 


^{^a : ^a* ) / 


= 1 . 


Note that this is a classical technique to deal with infinite union (or intersection), we omit the proof 
of equivalence here. 

To simplify our proof, we first present three utility lemmas. 

Lemma 21. Let ($„)n>i be a sequence of random variables such that A oo. Then for every 
player p and every arm a, 

TpXnio-) A Pa- 


Proof By the definition of convergence in probability, it suffices to show that for every 5 > 0 and 
every e > 0 we can find an iV, such that for any n > 

Pr (|/ip,<i,„(a) - Pal >6)<e. 

Fix a 5 > 0 and e > 0, we can choose Nq large enough such that 

( OO \ OO OO 

U IAp,s(a) - Pa\>s \ < ^ Pr (lAp.s(a) - Ta\ > S) < ^ < |. 

s=No / s=No s=No 

where the first inequality follows from union bound and the second follows from Hoeffding’s in¬ 
equality. Then by the definition of tending to infinity in probability, we can choose Ni large enough 
such that 

Pr ($„ < Nq) < for every n> Ni. 

Thus, for every n > iV, = max(A)], Ni), 


Pr («) - fial > <5) < Pr < No) + Pr ( |J \Pp,sia) - Ma| > <5 | < e, 

\s=No / 


which concludes the proof. 


□ 


Lemma 22. Let (T„)„>i be a sequence of random arms, (rn)n>i be a sequence of random players, 
and ($n)n>i be a sequence of random variables such that $„ A oo. Then 

3^ >0, Wo >0,Vn>iVo,iVr„,$„(T„) >($„)« - /rT„) A 0. (G.l) 
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Proof. Assume the left hand side of (IG.ll) holds. By the definition of convergence in probability, it 
suffices to show that for every J > 0 we have 

lim Pr (\B^ ^ (T„) - pT„| > = 0. 

n—¥00 \ n-, n J 

Fix a 0 < 5 < 1 — maxa^A Pa, note that 

Pr -PrJ>S^ < Pr (/lr„,Arr„,4^(T„)('Pn) < pr„ - S) 

+ Pr ^Ar„,Arr„,4.„(T„)(Tr„) > PT„ + 2 ) FT„ + <5) < ^ ^ ^ 


M 


< X! (Ap.Arr„,#„(T„)(a) < Pa-S) 

a^Ap—1 


M 


Pr ( Fp,tVr„,4,„(T„)(«) > Pa + 

a^Ap=l 


+ Pr (^ + «) < ^ 


(G.2) 

(G.3) 

(G.4) 


Recall that C^n) > i^n)^ for n large enough, and A 00 . As a consequence, 

iVr„,$„(T„) A c» as n ^ c». 

Then by Lemma 1271 both (IG.2b and (IG.31 I converge to 0 as n —^ c». For (IG.41 l. on the one hand, 
minagyi )C{fj,a + |, /Ta + i5) is a constant; on the other hand, C^n) > (fn) > i^n)^ 

for sufficiently large n and ($„)^ G w therefore 

—>■ 0 as n —>• 00 . 

C^n) 

Hence (IG.41 l is always 0 for n large enough. This concludes the proof. 


□ 


Lemma 23. Let (<i>„)„>i be a sequence of random variables such that A 00 . Then for every 
player p and every arm a, 

Pp,^r.{o) ^ pa- 

Proof First note that Pp,^^{a) = Ap.iVp *„(a)(n)- By Lemma ISTI it suffices to prove that 

Aip (a) A 00 . By the definition of convergence in probability, we only need to show that for 
every A^ > 0 and every e > 0, we can find an N^, such that for any n > N^,, 

Pr(A"p^$„(a) < N) < e. 

For each n > 1, let be the arm that player p has pulled most frequently by round and let T'n 
be the last round in the first $„ rounds that p pulled T„. The definition of T'ji implies that 

■^n>^n/ K. 

Therefore given <!)„ A c» we have T'n A cxd. In addition, we have 

K>^^/ K. 

Hence by Lemma |2^ 

lim Pr fijA (T„) >/rT„ + e) = 0, for any e > 0. (G.5) 

t^OO \ Ti J 

Now let i5 = (1 — maxa^A Pa) / 2. Then for every N > Owe have 
Pr ((a) < A^) = Pr (Afp,$„ (a) < Af A (T„) > B+^^ (a)) 


< Pr 

< Pr 


(T„) > 1 - < 5 ) + Pr (A(/ip,^.„(a), 1 - 5) > 

((Pn) > MT„ + < 5 ) + Pr (^iV • M • /C(0,1 - 5) > .F ) . 
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For every e > 0, by (IG.51 l we can choose Nq large enough such that for every n > Nq, 
(T„) > /rxn + S) < € / 2. Since N ■ M ■ /C(0,1 — J) is a constant, we can also choose 
Ni large enough such that for every n > Ni, N ■ M ■ /C(0,1 — <5) < T{n j K). Finally, according 
to the assumption that <!)„ A oo, we can choose N 2 large enough such that for every n > N 2 , 
Pr($„ < N 2 ) < e/2. Thus, for every n > = max(A^O)-^ij-^ 2 ), Pr (a) < N) < e. □ 


Then we can start our proof of Lemma|20l In fact, it is equivalent to prove the following equation; 

(1 - 


lim Pr I iV$„(a) < 


= 0 . 


5 ^a* ) 

For every round t, we define a random player pt such that 

iVp^,t(a) < for every p'G [M] (G. 6 ) 

and we say player pt is chosen for round t. For every n, we define a random player F^ such that r„ 
is the player who is chosen most frequently in the first rounds. We also let 2 be a random set 
including the rounds (in the first T rounds) in which player r„ is chosen. We let random variable 
T„ be the arm that is pulled most frequently by r„ in those rounds in 2„ and let be the last 
round in 2„ that F^ pulls T„. With all the definitions above, we have 






> 




as well as 


> 


MK - MK' 


MK 


Note that A 00 implies A 00 , which by (IG. 8 I 1 in turn implies 


T'n A CX) 


(G.7) 

(G. 8 ) 

(G.9) 


For any fixed T, clearly we have 

iV$„(a) < 




By ClaimfTTIand ( IG. 6 I 1 . 


(a) < 


Nt{a) < ^ , for every f < 

r'a* j 


^{ya 1 ^a* ) 

for every f < 


^{ya ^ ^a* ) 

Furthermore, by the definitions of random variables F„, T'ji, and T„, 


N^^{a) < 

Hence we have 

lim Pr ( (a) < 


(1 - .5) 

a 7 ^a* ) 

(1 - 


< —PT-^ 


< 


^{ya 7 ^a* ) 

{^y^AA < A > BA,yA) 

< lim Pr ^ (T„) > /iT„ + e) 

n—^oo y X Ti,^n y 


M 


+ lim Pr (a) < pa - e) 

n—>00 


p=l 


+ lim Pr ( /C(pq — e, y* + e) > 


(1 - , 5 ) 


(G.IO) 

(G.ll) 

(G.12) 
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where e is to be determined later. By Lemma|22]and (IG.7I) . the first term (IG.lOl l is 0. By Lemmal^ 
and ( IG. 9 I 1 . the second term (IG.lll l is 0. For the third term (IG.121 i. by (IG.8I) we have 


lim Pr ( /C(/ia “ e, + e) > 


F{'^n)K.{Va,Ua^) 


< lim Pr [ /C(/ia — e, /i* + e) > 


-FI 


MK 


K.{Va,Va*) 


( 1 - ) 

Now we decide e to be a positive real number small enough such that there exists a Tq satisfying 
/C(pa — e, + e) < - Ti -^ 6very t >To. 


Thus, 

lim Pr I IC{^a ~ Cj M* + c) ^ 

n—¥oo y 

This concludes the proof. 


(l-<5)ln(f) 


< lim Pr ($„ < To) = 0. 


Appendix H Proof of Claim [TI] 

It suffices to prove < M ■ Nt{a). In fact, 

^ + (^ - 1) ■ (Ap.t(a) - iVf(t)(a))) (by Definition) 

^ E^i (^^(‘)(«)+^ - ^^w(«))) 

< M ■ + M ■ — A«(t)(a)) 

< M ■ + M ■ (TV*(a) - iV,(t)(a)) 

= M-Ntia), 

which concludes the proof. 


Appendix I Proof of Claim [12] 


If a{C) = 0, then the right hand side becomes M ■ Np^t{a). By definition. 


KA<^) < NpAA + (M - 1) • {NpAA - A,(*)(«)) 

< Np^tio,) + (M — 1 ) • Np^tio.) 

= M-Np,t{a). 


If Np^t{a) = 0, then N'p ^{a) = 0 and the bound is trivial. Now we assume a{C) > 0 and Np^t{a) > 
0. Note that 


N'pAo) 

Np,t{a) 


1 + [M — 1) • min 


/ _ Ni(t){a) ujijt)) \ 

\ Np,t{a) ^ Np^a) J 


Let f{x) = min (l — / x,u(£(t)) j x). We have 


KAA 

Ap,t(a) 


< 1 + (M - 1) • 


sup 

yrEG(0,oo) 



Since 1 — 7V^(t)(a) / x is increasing in (0, 00 ) and u{l{t)) / x is decreasing in (0, 00 ), f{x) can be 
maximized if these two functions take the same value. In fact, when x = x* = 7V^(4)(a) + u{£{t)), 


20 
















we have f{x*) = 1 — A^^(i)(a) f x* = u{£{t)) / x*. Thus, 


NpAA 


< 1 + (M - 1) • 


im) 


< 1 + (M-1) 


M 


Ni(t)ia) + u{£(t)) 

( t ) ( O') / a (c ) — ( t ) ( a) 

_M_ 

Ne^Aa) + 


1 + {M - l)a{C)’ 


which completes the proof. 
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